:orphan:

Sklearn Basics 2: Train a Classifier on a Star Multi-Table Dataset
==================================================================

In this notebook, we will learn how to train a classifier with a
multi-table data composed of two tables (a root table and a secondary
table). It is highly recommended to see the *Sklearn Basics 1* lesson if
you are not familiar with Khiops’ sklearn estimators.

We start by importing the sklearn estimator ``KhiopsClassifier``:

.. code:: ipython3

    import os
    import pandas as pd
    from khiops import core as kh
    from khiops.sklearn import KhiopsClassifier, train_test_split_dataset
    from sklearn import metrics
    
    # If there are any issues you may Khiops status with the following command
    # kh.get_runner().print_status()

Training a Multi-Table Classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We’ll train a “sarcasm detector” using the dataset ``HeadlineSarcasm``.
In its raw form, the dataset contains a list of text headlines paired
with a label that indicates whether its source is a sarcastic site (such
as `The Onion <https://www.theonion.com>`__) or not.

We have transformed this dataset into two tables such that the
text-label record

::

   "groundbreaking study finds gratification can be deliberately postponed"    yes

is transformed to an entry in a table that contains (id, label) records

::

   97 yes

and various entries in a secondary table linking a headline id to its
words and positions

::

   97  0   groundbreaking
   97  1   study
   97  2   finds
   97  3   gratification
   97  4   can
   97  5   be
   97  6   deliberately
   97  7   postponed

Thus, the ``HeadlineSarcasm`` dataset has the following multi-table
schema

::

    +-----------+
    |Headline   |
    +-----------+ +-------------+
    |HeadlineId*| |HeadlineWords|
    |IsSarcastic| +-------------+
    +-----------+ |HeadlineId*  |
         |        |Position     |
         +-1:n--->|Word         |
                  +-------------+

The ``HeadlineId`` variable is special because it is a *key* that links
a particular headline to its words (a ``1:n`` relation).

*Note: There are other methods more appropriate for this text-mining
problem. This multi-table setup is only intended for pedagogical
purposes.*

To train the ``KhiopsClassifier`` for this setup we must specify a
multi-table dataset. Let’s first check the content of the created
tables: - The main table ``Headline`` - The secondary table
``HeadlineWords``

.. code:: ipython3

    sarcasm_dataset_dir = os.path.join("data", "HeadlineSarcasm")
    headlines_file = os.path.join(sarcasm_dataset_dir, "Headlines.txt")
    headlines_df = pd.read_csv(headlines_file, sep="\t")
    print("Headlines table (first 10 rows)")
    display(headlines_df.head(10))
    
    headlines_words_file = os.path.join(sarcasm_dataset_dir, "HeadlineWords.txt")
    headlines_words_df = pd.read_csv(headlines_words_file, sep="\t")
    print("HeadlineWords table (first 10 rows)")
    display(headlines_words_df.head(10))


.. parsed-literal::

    Headlines table (first 10 rows)


.. parsed-literal::

       HeadlineId IsSarcasm
    0           0       yes
    1           1        no
    2          10        no
    3         100       yes
    4        1000       yes
    5       10000        no
    6       10001       yes
    7       10002        no
    8       10003       yes
    9       10004        no


.. parsed-literal::

    HeadlineWords table (first 10 rows)


.. parsed-literal::

       HeadlineId  Position             Word
    0           0         0  thirtysomething
    1           0         1       scientists
    2           0         2           unveil
    3           0         3         doomsday
    4           0         4            clock
    5           0         5               of
    6           0         6             hair
    7           0         7             loss
    8           1         0              dem
    9           1         1             rep.


Before training the classifier, we split the main table into a feature
matrix (only the ``HeadlineId`` column) and a target vector containing
the labels (the ``IsSarcasm`` column).

.. code:: ipython3

    headlines_main_df = headlines_df.drop("IsSarcasm", axis=1)
    y_sarcasm = headlines_df["IsSarcasm"]

You may note that the feature matrix does not contain any *feature* but
do not worry. The Khiops AutoML engine will automatically create
features by aggregating the columns of ``HeadlineWords`` for each
headline (more details about this below).

Moreover, instead of passing an ``X`` table to the ``fit`` method, we
pass a *multi-table dataset* specification which is dictionary with the
following format:

::

   X = {
      "main_table": <name of the main table>,
      "tables" : {
          <name of table 1>: (<dataframe of table 1>, <key column names of table 1>),
          <name of table 2>: (<dataframe of table 2>, <key column names of table 2>),
          ...
       }
   }

Note that the key columns of each table are specified as a single name
or a tuple containing the column names composing the key.

So for our ``HeadlineSarcasm`` case, we specify the dataset as:

.. code:: ipython3

    X_sarcasm = {
        "main_table": "headlines",
        "tables": {
            "headlines": (headlines_main_df, "HeadlineId"),
            "headline_words": (headlines_words_df, "HeadlineId"),
        },
    }

To separate this dataset into train and test, we user the
``khiops-python`` helper function ``train_test_split_dataset``. This
function allows to separate ``dict`` dataset specifications:

.. code:: ipython3

    (
        X_sarcasm_train,
        X_sarcasm_test,
        y_sarcasm_train,
        y_sarcasm_test,
    ) = train_test_split_dataset(X_sarcasm, y_sarcasm)

The call to the ``KhiopsClassifier`` ``fit`` method is very similar to
the single-table case but this time we specify the additional parameter
``n_features`` which is the number of aggregates that Khiops AutoML
engine will construct and analyze during the training. Some examples of
the features it will create for ``HeadlineSarcasm`` are: - Number of
different words in the headline - Most common word in the headline -
Number of times the word ‘the’ appears - …

The Khiops AutoML engine will also evaluate, select and combine the
these features to build a classifier. We’ll here request for ``1000``
features (the default is ``100``):

*Note: By default Khiops builds 10 decision tree features. This is not
necessary for this tutorial so we set ``n_trees=0``*

.. code:: ipython3

    khc_sarcasm = KhiopsClassifier(n_features=1000, n_trees=0)
    khc_sarcasm.fit(X_sarcasm_train, y_sarcasm_train)


.. raw:: html

    <style>#sk-container-id-1 {
      /* Definition of color scheme common for light and dark mode */
      --sklearn-color-text: #000;
      --sklearn-color-text-muted: #666;
      --sklearn-color-line: gray;
      /* Definition of color scheme for unfitted estimators */
      --sklearn-color-unfitted-level-0: #fff5e6;
      --sklearn-color-unfitted-level-1: #f6e4d2;
      --sklearn-color-unfitted-level-2: #ffe0b3;
      --sklearn-color-unfitted-level-3: chocolate;
      /* Definition of color scheme for fitted estimators */
      --sklearn-color-fitted-level-0: #f0f8ff;
      --sklearn-color-fitted-level-1: #d4ebff;
      --sklearn-color-fitted-level-2: #b3dbfd;
      --sklearn-color-fitted-level-3: cornflowerblue;
    
      /* Specific color for light theme */
      --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));
      --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-icon: #696969;
    
      @media (prefers-color-scheme: dark) {
        /* Redefinition of color scheme for dark theme */
        --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));
        --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-icon: #878787;
      }
    }
    
    #sk-container-id-1 {
      color: var(--sklearn-color-text);
    }
    
    #sk-container-id-1 pre {
      padding: 0;
    }
    
    #sk-container-id-1 input.sk-hidden--visually {
      border: 0;
      clip: rect(1px 1px 1px 1px);
      clip: rect(1px, 1px, 1px, 1px);
      height: 1px;
      margin: -1px;
      overflow: hidden;
      padding: 0;
      position: absolute;
      width: 1px;
    }
    
    #sk-container-id-1 div.sk-dashed-wrapped {
      border: 1px dashed var(--sklearn-color-line);
      margin: 0 0.4em 0.5em 0.4em;
      box-sizing: border-box;
      padding-bottom: 0.4em;
      background-color: var(--sklearn-color-background);
    }
    
    #sk-container-id-1 div.sk-container {
      /* jupyter's `normalize.less` sets `[hidden] { display: none; }`
         but bootstrap.min.css set `[hidden] { display: none !important; }`
         so we also need the `!important` here to be able to override the
         default hidden behavior on the sphinx rendered scikit-learn.org.
         See: https://github.com/scikit-learn/scikit-learn/issues/21755 */
      display: inline-block !important;
      position: relative;
    }
    
    #sk-container-id-1 div.sk-text-repr-fallback {
      display: none;
    }
    
    div.sk-parallel-item,
    div.sk-serial,
    div.sk-item {
      /* draw centered vertical line to link estimators */
      background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));
      background-size: 2px 100%;
      background-repeat: no-repeat;
      background-position: center center;
    }
    
    /* Parallel-specific style estimator block */
    
    #sk-container-id-1 div.sk-parallel-item::after {
      content: "";
      width: 100%;
      border-bottom: 2px solid var(--sklearn-color-text-on-default-background);
      flex-grow: 1;
    }
    
    #sk-container-id-1 div.sk-parallel {
      display: flex;
      align-items: stretch;
      justify-content: center;
      background-color: var(--sklearn-color-background);
      position: relative;
    }
    
    #sk-container-id-1 div.sk-parallel-item {
      display: flex;
      flex-direction: column;
    }
    
    #sk-container-id-1 div.sk-parallel-item:first-child::after {
      align-self: flex-end;
      width: 50%;
    }
    
    #sk-container-id-1 div.sk-parallel-item:last-child::after {
      align-self: flex-start;
      width: 50%;
    }
    
    #sk-container-id-1 div.sk-parallel-item:only-child::after {
      width: 0;
    }
    
    /* Serial-specific style estimator block */
    
    #sk-container-id-1 div.sk-serial {
      display: flex;
      flex-direction: column;
      align-items: center;
      background-color: var(--sklearn-color-background);
      padding-right: 1em;
      padding-left: 1em;
    }
    
    
    /* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is
    clickable and can be expanded/collapsed.
    - Pipeline and ColumnTransformer use this feature and define the default style
    - Estimators will overwrite some part of the style using the `sk-estimator` class
    */
    
    /* Pipeline and ColumnTransformer style (default) */
    
    #sk-container-id-1 div.sk-toggleable {
      /* Default theme specific background. It is overwritten whether we have a
      specific estimator or a Pipeline/ColumnTransformer */
      background-color: var(--sklearn-color-background);
    }
    
    /* Toggleable label */
    #sk-container-id-1 label.sk-toggleable__label {
      cursor: pointer;
      display: flex;
      width: 100%;
      margin-bottom: 0;
      padding: 0.5em;
      box-sizing: border-box;
      text-align: center;
      align-items: start;
      justify-content: space-between;
      gap: 0.5em;
    }
    
    #sk-container-id-1 label.sk-toggleable__label .caption {
      font-size: 0.6rem;
      font-weight: lighter;
      color: var(--sklearn-color-text-muted);
    }
    
    #sk-container-id-1 label.sk-toggleable__label-arrow:before {
      /* Arrow on the left of the label */
      content: "▸";
      float: left;
      margin-right: 0.25em;
      color: var(--sklearn-color-icon);
    }
    
    #sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {
      color: var(--sklearn-color-text);
    }
    
    /* Toggleable content - dropdown */
    
    #sk-container-id-1 div.sk-toggleable__content {
      max-height: 0;
      max-width: 0;
      overflow: hidden;
      text-align: left;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-1 div.sk-toggleable__content.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    #sk-container-id-1 div.sk-toggleable__content pre {
      margin: 0.2em;
      border-radius: 0.25em;
      color: var(--sklearn-color-text);
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-1 div.sk-toggleable__content.fitted pre {
      /* unfitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    #sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {
      /* Expand drop-down */
      max-height: 200px;
      max-width: 100%;
      overflow: auto;
    }
    
    #sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {
      content: "▾";
    }
    
    /* Pipeline/ColumnTransformer-specific style */
    
    #sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Estimator-specific style */
    
    /* Colorize estimator box */
    #sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    #sk-container-id-1 div.sk-label label.sk-toggleable__label,
    #sk-container-id-1 div.sk-label label {
      /* The background is the default theme color */
      color: var(--sklearn-color-text-on-default-background);
    }
    
    /* On hover, darken the color of the background */
    #sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    /* Label box, darken color on hover, fitted */
    #sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Estimator label */
    
    #sk-container-id-1 div.sk-label label {
      font-family: monospace;
      font-weight: bold;
      display: inline-block;
      line-height: 1.2em;
    }
    
    #sk-container-id-1 div.sk-label-container {
      text-align: center;
    }
    
    /* Estimator-specific */
    #sk-container-id-1 div.sk-estimator {
      font-family: monospace;
      border: 1px dotted var(--sklearn-color-border-box);
      border-radius: 0.25em;
      box-sizing: border-box;
      margin-bottom: 0.5em;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-1 div.sk-estimator.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    /* on hover */
    #sk-container-id-1 div.sk-estimator:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-1 div.sk-estimator.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Specification for estimator info (e.g. "i" and "?") */
    
    /* Common style for "i" and "?" */
    
    .sk-estimator-doc-link,
    a:link.sk-estimator-doc-link,
    a:visited.sk-estimator-doc-link {
      float: right;
      font-size: smaller;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1em;
      height: 1em;
      width: 1em;
      text-decoration: none !important;
      margin-left: 0.5em;
      text-align: center;
      /* unfitted */
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
      color: var(--sklearn-color-unfitted-level-1);
    }
    
    .sk-estimator-doc-link.fitted,
    a:link.sk-estimator-doc-link.fitted,
    a:visited.sk-estimator-doc-link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }
    
    /* On hover */
    div.sk-estimator:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover,
    div.sk-label-container:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover,
    div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    /* Span, style for the box shown on hovering the info icon */
    .sk-estimator-doc-link span {
      display: none;
      z-index: 9999;
      position: relative;
      font-weight: normal;
      right: .2ex;
      padding: .5ex;
      margin: .5ex;
      width: min-content;
      min-width: 20ex;
      max-width: 50ex;
      color: var(--sklearn-color-text);
      box-shadow: 2pt 2pt 4pt #999;
      /* unfitted */
      background: var(--sklearn-color-unfitted-level-0);
      border: .5pt solid var(--sklearn-color-unfitted-level-3);
    }
    
    .sk-estimator-doc-link.fitted span {
      /* fitted */
      background: var(--sklearn-color-fitted-level-0);
      border: var(--sklearn-color-fitted-level-3);
    }
    
    .sk-estimator-doc-link:hover span {
      display: block;
    }
    
    /* "?"-specific style due to the `<a>` HTML tag */
    
    #sk-container-id-1 a.estimator_doc_link {
      float: right;
      font-size: 1rem;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1rem;
      height: 1rem;
      width: 1rem;
      text-decoration: none;
      /* unfitted */
      color: var(--sklearn-color-unfitted-level-1);
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
    }
    
    #sk-container-id-1 a.estimator_doc_link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }
    
    /* On hover */
    #sk-container-id-1 a.estimator_doc_link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    #sk-container-id-1 a.estimator_doc_link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
    }
    </style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>KhiopsClassifier(n_features=1000, n_trees=0)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked><label for="sk-estimator-id-1" class="sk-toggleable__label fitted sk-toggleable__label-arrow"><div><div>KhiopsClassifier</div></div><div><span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></div></label><div class="sk-toggleable__content fitted"><pre>KhiopsClassifier(n_features=1000, n_trees=0)</pre></div> </div></div></div></div>


We quickly check its train accuracy and auc as in the previous tutorial:

.. code:: ipython3

    sarcasm_train_performance = (
        khc_sarcasm.model_report_.train_evaluation_report.get_snb_performance()
    )
    print(f"HeadlineSarcasm train accuracy: {sarcasm_train_performance.accuracy}")
    print(f"HeadlineSarcasm train auc     : {sarcasm_train_performance.auc}")


.. parsed-literal::

    HeadlineSarcasm train accuracy: 0.850867
    HeadlineSarcasm train auc     : 0.933792


Now, we use our sarcasm classifier to obtain predictions and
probabilities on the test data:

.. code:: ipython3

    y_sarcasm_test_predicted = khc_sarcasm.predict(X_sarcasm_test)
    probas_sarcasm_test = khc_sarcasm.predict_proba(X_sarcasm_test)
    
    print("HeadlineSarcasm test predictions (first 10 values):")
    display(y_sarcasm_test_predicted[:10])
    print("HeadlineSarcasm test prediction probabilities (first 10 values):")
    display(probas_sarcasm_test[:10])


.. parsed-literal::

    HeadlineSarcasm test predictions (first 10 values):


.. parsed-literal::

    array(['no', 'no', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no'],
          dtype=object)


.. parsed-literal::

    HeadlineSarcasm test prediction probabilities (first 10 values):


.. parsed-literal::

    array([[0.98051026, 0.01948974],
           [0.7168483 , 0.2831517 ],
           [0.48756231, 0.51243769],
           [0.08162827, 0.91837173],
           [0.30038081, 0.69961919],
           [0.8818798 , 0.1181202 ],
           [0.87340021, 0.12659979],
           [0.74002932, 0.25997068],
           [0.96795465, 0.03204535],
           [0.75040413, 0.24959587]])


Finally we may estimate the accuracy and AUC for the test data:

.. code:: ipython3

    sarcasm_test_accuracy = metrics.accuracy_score(y_sarcasm_test, y_sarcasm_test_predicted)
    sarcasm_test_auc = metrics.roc_auc_score(y_sarcasm_test, probas_sarcasm_test[:, 1])
    
    print(f"Sarcasm test accuracy: {sarcasm_test_accuracy}")
    print(f"Sarcasm test auc     : {sarcasm_test_auc}")


.. parsed-literal::

    Sarcasm test accuracy: 0.8251572327044026
    Sarcasm test auc     : 0.9083383942327357


To further explore the results we can see the report with the Khiops
Visualization app:

.. code:: ipython3

    # To visualize uncomment the lines below
    khc_sarcasm.export_report_file("./sarcasm_report.khj")
    kh.visualize_report("./sarcasm_report.khj")


.. parsed-literal::

    Could not open report file: [Errno 2] No such file or directory: 'xdg-open'. Path: ./sarcasm_report.khj


Exercise
~~~~~~~~

Repeat the previous steps with the ``AccidentsSummary`` dataset. This
dataset describes the characteristics of traffic accidents that happened
in France in 2018. It has two tables with the following schema:

::

   +---------------+
   |Accidents      |
   +---------------+
   |AccidentId*    |
   |Gravity        |
   |Date           |
   |Hour           | +---------------+
   |Light          | |Vehicles       |
   |Department     | +---------------+
   |Commune        | |AccidentId*    |
   |InAgglomeration| |VehicleId*     |
   |...            | |Direction      |
   +---------------+ |Category       |
          |          |PassengerNumber|
          +---1:n--->|...            |
                     +---------------+

For each accident, we have both its characteristics (such as ``Gravity``
or ``Light`` conditions) and those of each involved vehicle (its
``Direction`` or ``PassengerNumber``). We first load the tables of the
``AccidentsSummary`` into dataframes:

.. code:: ipython3

    accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "AccidentsSummary")
    
    accidents_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
    accidents_df = pd.read_csv(accidents_file, sep="\t", encoding="latin1")
    print(f"Accidents dataframe (first 10 rows):")
    display(accidents_df.head(10))
    print()
    
    vehicles_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
    vehicles_df = pd.read_csv(vehicles_file, sep="\t", encoding="latin1")
    print(f"Vehicles dataframe (first 10 rows):")
    display(vehicles_df.head(10))


.. parsed-literal::

    Accidents dataframe (first 10 rows):


.. parsed-literal::

         AccidentId    Gravity        Date      Hour               Light  \
    0  201800000001  NonLethal  2018-01-24  15:05:00            Daylight   
    1  201800000002  NonLethal  2018-02-12  10:15:00            Daylight   
    2  201800000003  NonLethal  2018-03-04  11:35:00            Daylight   
    3  201800000004  NonLethal  2018-05-05  17:35:00            Daylight   
    4  201800000005  NonLethal  2018-06-26  16:05:00            Daylight   
    5  201800000006  NonLethal  2018-09-23  06:30:00      TwilightOrDawn   
    6  201800000007  NonLethal  2018-09-26  00:40:00  NightStreelightsOn   
    7  201800000008     Lethal  2018-11-30  17:15:00  NightStreelightsOn   
    8  201800000009  NonLethal  2018-02-18  15:57:00            Daylight   
    9  201800000010  NonLethal  2018-03-19  15:30:00            Daylight   
    
       Department  Commune InAgglomeration IntersectionType    Weather  \
    0         590        5              No           Y-type     Normal   
    1         590       11             Yes           Square   VeryGood   
    2         590      477             Yes           T-type     Normal   
    3         590       52             Yes   NoIntersection   VeryGood   
    4         590      477             Yes   NoIntersection     Normal   
    5         590       52             Yes   NoIntersection  LightRain   
    6         590      133             Yes   NoIntersection     Normal   
    7         590       11             Yes   NoIntersection     Normal   
    8         590      550              No   NoIntersection     Normal   
    9         590       51             Yes           X-type     Normal   
    
                          CollisionType               PostalAddress  
    0  2Vehicles-BehindVehicles-Frontal      route des Ansereuilles  
    1                       NoCollision  Place du gÃ©nÃ©ral de Gaul  
    2                       NoCollision              Rue  nationale  
    3                    2Vehicles-Side         30 rue Jules Guesde  
    4                    2Vehicles-Side          72 rue Victor Hugo  
    5                             Other                         D39  
    6                             Other          4 route de camphin  
    7                             Other          rue saint exupÃ©ry  
    8                             Other          rue de l'Ã©galitÃ©  
    9  2Vehicles-BehindVehicles-Frontal     face au 59 rue de Lille  


.. parsed-literal::

    
    Vehicles dataframe (first 10 rows):


.. parsed-literal::

         AccidentId VehicleId Direction          Category  PassengerNumber  \
    0  201800000001       A01   Unknown         Car<=3.5T                0   
    1  201800000001       B01   Unknown         Car<=3.5T                0   
    2  201800000002       A01   Unknown         Car<=3.5T                0   
    3  201800000003       A01   Unknown  Motorbike>125cm3                0   
    4  201800000003       B01   Unknown         Car<=3.5T                0   
    5  201800000003       C01   Unknown         Car<=3.5T                0   
    6  201800000004       A01   Unknown         Car<=3.5T                0   
    7  201800000004       B01   Unknown           Bicycle                0   
    8  201800000005       A01   Unknown             Moped                0   
    9  201800000005       B01   Unknown         Car<=3.5T                0   
    
           FixedObstacle MobileObstacle ImpactPoint           Maneuver  
    0                NaN        Vehicle  RightFront         TurnToLeft  
    1                NaN        Vehicle   LeftFront  NoDirectionChange  
    2                NaN     Pedestrian         NaN  NoDirectionChange  
    3  StationaryVehicle        Vehicle       Front  NoDirectionChange  
    4                NaN        Vehicle    LeftSide         TurnToLeft  
    5                NaN            NaN   RightSide             Parked  
    6                NaN          Other  RightFront          Avoidance  
    7                NaN        Vehicle    LeftSide                NaN  
    8                NaN        Vehicle  RightFront           PassLeft  
    9                NaN        Vehicle   LeftFront               Park  


Create the main feature matrix and the target vector for ``AccidentsSummary``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note that the target variable is ``Gravity``.

.. code:: ipython3

    accidents_main_df = accidents_df.drop("Gravity", axis=1)
    y_accidents = accidents_df["Gravity"]

Create the multi-table dataset specification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note the main table has one key ``AccidentId`` and the secondary table
has two keys ``AccidentId`` and ``VehicleId``.

.. code:: ipython3

    X_accidents = {
        "main_table": "accidents",
        "tables": {
            "accidents": (accidents_main_df, "AccidentId"),
            "vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
        },
    }

Split the dataset into train and test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    (
        X_accidents_train,
        X_accidents_test,
        y_accidents_train,
        y_accidents_test,
    ) = train_test_split_dataset(X_accidents, y_accidents)

Train a classifier with this dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  You may choose the number of features ``n_features`` to be created by
   the Khiops AutoML engine
-  Set the number of trees to zero (``n_trees=0``)

.. code:: ipython3

    khc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)
    khc_accidents.fit(X_accidents_train, y_accidents_train)


.. raw:: html

    <style>#sk-container-id-2 {
      /* Definition of color scheme common for light and dark mode */
      --sklearn-color-text: #000;
      --sklearn-color-text-muted: #666;
      --sklearn-color-line: gray;
      /* Definition of color scheme for unfitted estimators */
      --sklearn-color-unfitted-level-0: #fff5e6;
      --sklearn-color-unfitted-level-1: #f6e4d2;
      --sklearn-color-unfitted-level-2: #ffe0b3;
      --sklearn-color-unfitted-level-3: chocolate;
      /* Definition of color scheme for fitted estimators */
      --sklearn-color-fitted-level-0: #f0f8ff;
      --sklearn-color-fitted-level-1: #d4ebff;
      --sklearn-color-fitted-level-2: #b3dbfd;
      --sklearn-color-fitted-level-3: cornflowerblue;
    
      /* Specific color for light theme */
      --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));
      --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-icon: #696969;
    
      @media (prefers-color-scheme: dark) {
        /* Redefinition of color scheme for dark theme */
        --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));
        --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-icon: #878787;
      }
    }
    
    #sk-container-id-2 {
      color: var(--sklearn-color-text);
    }
    
    #sk-container-id-2 pre {
      padding: 0;
    }
    
    #sk-container-id-2 input.sk-hidden--visually {
      border: 0;
      clip: rect(1px 1px 1px 1px);
      clip: rect(1px, 1px, 1px, 1px);
      height: 1px;
      margin: -1px;
      overflow: hidden;
      padding: 0;
      position: absolute;
      width: 1px;
    }
    
    #sk-container-id-2 div.sk-dashed-wrapped {
      border: 1px dashed var(--sklearn-color-line);
      margin: 0 0.4em 0.5em 0.4em;
      box-sizing: border-box;
      padding-bottom: 0.4em;
      background-color: var(--sklearn-color-background);
    }
    
    #sk-container-id-2 div.sk-container {
      /* jupyter's `normalize.less` sets `[hidden] { display: none; }`
         but bootstrap.min.css set `[hidden] { display: none !important; }`
         so we also need the `!important` here to be able to override the
         default hidden behavior on the sphinx rendered scikit-learn.org.
         See: https://github.com/scikit-learn/scikit-learn/issues/21755 */
      display: inline-block !important;
      position: relative;
    }
    
    #sk-container-id-2 div.sk-text-repr-fallback {
      display: none;
    }
    
    div.sk-parallel-item,
    div.sk-serial,
    div.sk-item {
      /* draw centered vertical line to link estimators */
      background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));
      background-size: 2px 100%;
      background-repeat: no-repeat;
      background-position: center center;
    }
    
    /* Parallel-specific style estimator block */
    
    #sk-container-id-2 div.sk-parallel-item::after {
      content: "";
      width: 100%;
      border-bottom: 2px solid var(--sklearn-color-text-on-default-background);
      flex-grow: 1;
    }
    
    #sk-container-id-2 div.sk-parallel {
      display: flex;
      align-items: stretch;
      justify-content: center;
      background-color: var(--sklearn-color-background);
      position: relative;
    }
    
    #sk-container-id-2 div.sk-parallel-item {
      display: flex;
      flex-direction: column;
    }
    
    #sk-container-id-2 div.sk-parallel-item:first-child::after {
      align-self: flex-end;
      width: 50%;
    }
    
    #sk-container-id-2 div.sk-parallel-item:last-child::after {
      align-self: flex-start;
      width: 50%;
    }
    
    #sk-container-id-2 div.sk-parallel-item:only-child::after {
      width: 0;
    }
    
    /* Serial-specific style estimator block */
    
    #sk-container-id-2 div.sk-serial {
      display: flex;
      flex-direction: column;
      align-items: center;
      background-color: var(--sklearn-color-background);
      padding-right: 1em;
      padding-left: 1em;
    }
    
    
    /* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is
    clickable and can be expanded/collapsed.
    - Pipeline and ColumnTransformer use this feature and define the default style
    - Estimators will overwrite some part of the style using the `sk-estimator` class
    */
    
    /* Pipeline and ColumnTransformer style (default) */
    
    #sk-container-id-2 div.sk-toggleable {
      /* Default theme specific background. It is overwritten whether we have a
      specific estimator or a Pipeline/ColumnTransformer */
      background-color: var(--sklearn-color-background);
    }
    
    /* Toggleable label */
    #sk-container-id-2 label.sk-toggleable__label {
      cursor: pointer;
      display: flex;
      width: 100%;
      margin-bottom: 0;
      padding: 0.5em;
      box-sizing: border-box;
      text-align: center;
      align-items: start;
      justify-content: space-between;
      gap: 0.5em;
    }
    
    #sk-container-id-2 label.sk-toggleable__label .caption {
      font-size: 0.6rem;
      font-weight: lighter;
      color: var(--sklearn-color-text-muted);
    }
    
    #sk-container-id-2 label.sk-toggleable__label-arrow:before {
      /* Arrow on the left of the label */
      content: "▸";
      float: left;
      margin-right: 0.25em;
      color: var(--sklearn-color-icon);
    }
    
    #sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {
      color: var(--sklearn-color-text);
    }
    
    /* Toggleable content - dropdown */
    
    #sk-container-id-2 div.sk-toggleable__content {
      max-height: 0;
      max-width: 0;
      overflow: hidden;
      text-align: left;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-2 div.sk-toggleable__content.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    #sk-container-id-2 div.sk-toggleable__content pre {
      margin: 0.2em;
      border-radius: 0.25em;
      color: var(--sklearn-color-text);
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-2 div.sk-toggleable__content.fitted pre {
      /* unfitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    #sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {
      /* Expand drop-down */
      max-height: 200px;
      max-width: 100%;
      overflow: auto;
    }
    
    #sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {
      content: "▾";
    }
    
    /* Pipeline/ColumnTransformer-specific style */
    
    #sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-2 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Estimator-specific style */
    
    /* Colorize estimator box */
    #sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-2 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    #sk-container-id-2 div.sk-label label.sk-toggleable__label,
    #sk-container-id-2 div.sk-label label {
      /* The background is the default theme color */
      color: var(--sklearn-color-text-on-default-background);
    }
    
    /* On hover, darken the color of the background */
    #sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    /* Label box, darken color on hover, fitted */
    #sk-container-id-2 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Estimator label */
    
    #sk-container-id-2 div.sk-label label {
      font-family: monospace;
      font-weight: bold;
      display: inline-block;
      line-height: 1.2em;
    }
    
    #sk-container-id-2 div.sk-label-container {
      text-align: center;
    }
    
    /* Estimator-specific */
    #sk-container-id-2 div.sk-estimator {
      font-family: monospace;
      border: 1px dotted var(--sklearn-color-border-box);
      border-radius: 0.25em;
      box-sizing: border-box;
      margin-bottom: 0.5em;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-2 div.sk-estimator.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    /* on hover */
    #sk-container-id-2 div.sk-estimator:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-2 div.sk-estimator.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Specification for estimator info (e.g. "i" and "?") */
    
    /* Common style for "i" and "?" */
    
    .sk-estimator-doc-link,
    a:link.sk-estimator-doc-link,
    a:visited.sk-estimator-doc-link {
      float: right;
      font-size: smaller;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1em;
      height: 1em;
      width: 1em;
      text-decoration: none !important;
      margin-left: 0.5em;
      text-align: center;
      /* unfitted */
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
      color: var(--sklearn-color-unfitted-level-1);
    }
    
    .sk-estimator-doc-link.fitted,
    a:link.sk-estimator-doc-link.fitted,
    a:visited.sk-estimator-doc-link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }
    
    /* On hover */
    div.sk-estimator:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover,
    div.sk-label-container:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover,
    div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    /* Span, style for the box shown on hovering the info icon */
    .sk-estimator-doc-link span {
      display: none;
      z-index: 9999;
      position: relative;
      font-weight: normal;
      right: .2ex;
      padding: .5ex;
      margin: .5ex;
      width: min-content;
      min-width: 20ex;
      max-width: 50ex;
      color: var(--sklearn-color-text);
      box-shadow: 2pt 2pt 4pt #999;
      /* unfitted */
      background: var(--sklearn-color-unfitted-level-0);
      border: .5pt solid var(--sklearn-color-unfitted-level-3);
    }
    
    .sk-estimator-doc-link.fitted span {
      /* fitted */
      background: var(--sklearn-color-fitted-level-0);
      border: var(--sklearn-color-fitted-level-3);
    }
    
    .sk-estimator-doc-link:hover span {
      display: block;
    }
    
    /* "?"-specific style due to the `<a>` HTML tag */
    
    #sk-container-id-2 a.estimator_doc_link {
      float: right;
      font-size: 1rem;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1rem;
      height: 1rem;
      width: 1rem;
      text-decoration: none;
      /* unfitted */
      color: var(--sklearn-color-unfitted-level-1);
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
    }
    
    #sk-container-id-2 a.estimator_doc_link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }
    
    /* On hover */
    #sk-container-id-2 a.estimator_doc_link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    #sk-container-id-2 a.estimator_doc_link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
    }
    </style><div id="sk-container-id-2" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>KhiopsClassifier(n_features=1000, n_trees=0)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" checked><label for="sk-estimator-id-2" class="sk-toggleable__label fitted sk-toggleable__label-arrow"><div><div>KhiopsClassifier</div></div><div><span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></div></label><div class="sk-toggleable__content fitted"><pre>KhiopsClassifier(n_features=1000, n_trees=0)</pre></div> </div></div></div></div>


Print the train accuracy and auc of the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    accidents_train_performance = (
        khc_accidents.model_report_.train_evaluation_report.get_snb_performance()
    )
    print(f"AccidentsSummary train accuracy: {accidents_train_performance.accuracy}")
    print(f"AccidentsSummary train auc     : {accidents_train_performance.auc}")


.. parsed-literal::

    AccidentsSummary train accuracy: 0.944343
    AccidentsSummary train auc     : 0.81777


Deploy the classifier to obtain predictions and its probabilites on the test data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    y_accidents_test_predicted = khc_accidents.predict(X_accidents_test)
    probas_accidents_test = khc_accidents.predict_proba(X_accidents_test)
    
    print("Accidents test predictions (first 10 values):")
    display(y_accidents_test_predicted[:10])
    print("Accidents test prediction probabilities (first 10 values):")
    display(probas_accidents_test[:10])


.. parsed-literal::

    Accidents test predictions (first 10 values):


.. parsed-literal::

    array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal',
           'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'],
          dtype=object)


.. parsed-literal::

    Accidents test prediction probabilities (first 10 values):


.. parsed-literal::

    array([[0.194344  , 0.805656  ],
           [0.00707682, 0.99292318],
           [0.03085459, 0.96914541],
           [0.08640951, 0.91359049],
           [0.01865278, 0.98134722],
           [0.00681306, 0.99318694],
           [0.0062505 , 0.9937495 ],
           [0.17195874, 0.82804126],
           [0.02707476, 0.97292524],
           [0.01174233, 0.98825767]])


Obtain the accuracy and AUC on the test dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    accidents_test_accuracy = metrics.accuracy_score(
        y_accidents_test, y_accidents_test_predicted
    )
    accidents_test_auc = metrics.roc_auc_score(
        y_accidents_test, probas_accidents_test[:, 1]
    )
    
    print(f"Accidents test accuracy: {accidents_test_accuracy}")
    print(f"Accidents test auc     : {accidents_test_auc}")


.. parsed-literal::

    Accidents test accuracy: 0.9472518344178319
    Accidents test auc     : 0.8079238757149434


Explore the report with the Khiops Visualization App
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    # To visualize uncomment the lines below
    # khc_accidents.export_report_file("./accidents_report.khj")
    # kh.visualize_report("./accidents_report.khj")